Regular Expressions in R

Author

Martin Schweinberger

Published

January 1, 2026

Introduction

This tutorial introduces regular expressions (regex) and demonstrates how to use them when working with language data in R. A regular expression is a special sequence of characters that describes a search pattern. You can think of regular expressions as precision search tools — far more powerful than simple find-and-replace — that let you locate, extract, validate, and transform text based on its structure rather than its exact content.

Regular expressions have wide applications across linguistics and computational humanities: searching corpora for inflected forms, extracting named entities, cleaning OCR output, tokenising text, validating annotation schemes, and building text-processing pipelines. Once mastered, they become one of the most versatile tools in any language researcher’s toolkit.

Learning Objectives

By the end of this tutorial you will be able to:

  1. Explain what a regular expression is and how it differs from a simple string search
  2. Construct patterns using literal characters, the wildcard ., anchors, character classes, and POSIX classes
  3. Apply quantifiers — including greedy and lazy variants — to specify repetition
  4. Use capturing groups, non-capturing groups, and alternation
  5. Use shorthand escape sequences (\w, \d, \s) and understand the double-backslash requirement in R
  6. Write lookahead and lookbehind assertions for context-sensitive matching
  7. Apply the key stringr functions — str_detect(), str_extract(), str_replace(), and others — with regular expressions
  8. Use regular expressions for practical corpus tasks: concordance searches, text cleaning, metadata extraction, and frequency analysis
  9. Integrate regex with dplyr pipelines for filtering and annotation
Prerequisite Tutorials

Before working through this tutorial, please complete or familiarise yourself with:

Citation

Martin Schweinberger. 2026. Regular Expressions in R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/regex/regex.html (Version 2026.03.28).

External Resources

For further study, the following resources are highly recommended:


Preparation and Session Set-up

Install required packages (once only):

Code
install.packages("stringr")
install.packages("dplyr")
install.packages("flextable")
install.packages("checkdown")

Load packages:

Code
library(stringr)     # string manipulation and regex functions
library(dplyr)       # data frame manipulation
library(flextable)   # formatted tables
library(checkdown)   # interactive exercises

options(stringsAsFactors = FALSE)
options(scipen = 100)

We will work with two types of objects throughout: a short example sentence for demonstrating individual patterns, and a longer example text representing realistic corpus data.

Code
# Short example sentence for basic demonstrations
sent <- "The cat sat on the mat."

# A longer example text: an excerpt about linguistics
et <- paste(
  "Grammar is the system of a language. People sometimes describe grammar as",
  "the rules of a language, but in fact no language has rules. If we use the",
  "word rules, we suggest that somebody created the rules first and then spoke",
  "the language, like the rules of a game. But languages did not start like",
  "that. Languages started when humans started to communicate with each other.",
  "Grammars developed naturally. After some time, people described the grammar",
  "of their languages. Languages change over time. Grammar changes too.",
  "Children learn the grammar of their first language naturally. They do not",
  "need to study it. Native speakers know intuitively whether a sentence is",
  "grammatically correct or not. Non-native speakers often learn grammar rules",
  "formally, through instruction. Prescriptive grammar describes how people",
  "should speak, while descriptive grammar describes how people actually speak.",
  "Linguists study grammars to understand language structure and acquisition.",
  "The field of syntax deals with sentence structure, while morphology examines",
  "how words are formed. Phonology studies sound systems in human languages.",
  "Pragmatics investigates how context influences the interpretation of meaning.",
  "Computational linguistics applies formal grammar to natural language processing.",
  "Regular expressions are useful tools for searching and extracting patterns.",
  "They can match words like 'cat', 'bat', or 'hat' with a single pattern."
)

# Split into individual tokens (words and punctuation)
tokens <- str_split(et, "\\s+") |> unlist()

Regular Expression Patterns

Section Overview

What you will learn: The building blocks of regular expressions — how each type of pattern works and what it matches.

Key concept: Regular expressions describe structure, not content. [aeiou]{2,} matches any sequence of two or more vowels, regardless of which vowels or in which word.

Basic characters

The simplest regular expression is a literal character — it matches exactly that character. A sequence of literal characters matches that exact sequence:

Code
# Literal match: does "cat" appear in the sentence?
str_detect(sent, "cat")
[1] TRUE
Code
# The dot . matches ANY single character except newline
str_detect(sent, "c.t")    # matches "cat"
[1] TRUE
Code
str_detect(sent, "m.t")    # matches "mat"
[1] TRUE
Code
str_detect(sent, ".at")    # matches "cat", "sat", "mat"
[1] TRUE

To match a literal dot (rather than “any character”), escape it with a double backslash:

Code
# Match a literal period at the end of the sentence
str_detect(sent, "\\.")    # TRUE — the sentence ends with a full stop
[1] TRUE
Code
# Without escaping, . matches any character:
str_detect("abc", ".")     # TRUE — any character matches
[1] TRUE
Code
str_detect("abc", "\\.")   # FALSE — no literal dot in "abc"
[1] FALSE
The double backslash in R

In most programming languages, a single backslash \ is the regex escape character. In R strings, \ itself must be escaped, so regex escapes require double backslash \\. For example:

  • \\. in R code → \. as a regex → matches a literal dot
  • \\b in R code → \b as a regex → matches a word boundary
  • \\d in R code → \d as a regex → matches a digit

This double-backslash requirement catches many beginners. Remember: every \ you intend for regex needs to be written as \\ in R.

Anchors

Anchors match positions in the string, not characters. They constrain where in the string a pattern can match.

Code
# ^ matches the START of the string
str_detect(sent, "^The")     # TRUE — "The" is at the start
[1] TRUE
Code
str_detect(sent, "^cat")     # FALSE — "cat" is not at the start
[1] FALSE
Code
# $ matches the END of the string
str_detect(sent, "mat\\.$")  # TRUE — "mat." is at the end
[1] TRUE
Code
str_detect(sent, "cat$")     # FALSE — "cat" is not at the end
[1] FALSE
Code
# \b matches a WORD BOUNDARY (between a word char and a non-word char)
str_detect("catalogue", "\\bcat\\b")   # FALSE — "cat" is part of a word
[1] FALSE
Code
str_detect("the cat sat", "\\bcat\\b") # TRUE — "cat" is a whole word
[1] TRUE
Code
# \B matches where \b does NOT (i.e., inside a word)
str_detect("catalogue", "\\Bcat\\B")   # FALSE — "cat" is at word START
[1] FALSE
Code
str_detect("concatenate", "\\Bcat\\B") # TRUE — "cat" is in the middle
[1] TRUE
Word boundaries in corpus searches

\b is indispensable for corpus searches. Without it, searching for “the” would match “the” inside “other”, “there”, “ather”, and so on. Always use \\bword\\b when you want whole-word matches.

Character classes

A character class [...] matches any single character from the set listed inside the brackets:

Code
# Match 'c', 's', or 'm' followed by 'at'
str_extract_all(sent, "[csm]at")
[[1]]
[1] "cat" "sat" "mat"
Code
# Negated class [^...]: match any character NOT in the set
str_extract_all(sent, "[^aeiou ]")   # non-vowel, non-space characters
[[1]]
 [1] "T" "h" "c" "t" "s" "t" "n" "t" "h" "m" "t" "."
Code
# Ranges
str_extract_all("Hello World 123", "[a-z]")    # lowercase letters
[[1]]
[1] "e" "l" "l" "o" "o" "r" "l" "d"
Code
str_extract_all("Hello World 123", "[A-Z]")    # uppercase letters
[[1]]
[1] "H" "W"
Code
str_extract_all("Hello World 123", "[0-9]")    # digits
[[1]]
[1] "1" "2" "3"
Code
str_extract_all("Hello World 123", "[a-zA-Z]") # all letters
[[1]]
 [1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d"

POSIX character classes

R supports POSIX character classes — named sets written inside [:..:] inside an outer [...]:

Code
str_extract_all("Hello, World! 123.", "[[:alpha:]]")  # letters only
[[1]]
 [1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d"
Code
str_extract_all("Hello, World! 123.", "[[:digit:]]")  # digits only
[[1]]
[1] "1" "2" "3"
Code
str_extract_all("Hello, World! 123.", "[[:punct:]]")  # punctuation only
[[1]]
[1] "," "!" "."
Code
str_extract_all("Hello, World! 123.", "[[:alnum:]]")  # letters and digits
[[1]]
 [1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d" "1" "2" "3"
Code
str_extract_all("Hello\tWorld  123",  "[[:blank:]]")  # spaces and tabs
[[1]]
[1] "\t" " "  " " 

The full set of POSIX classes available in R:

Class

Matches

[:alpha:]

Any letter (a-z, A-Z)

[:lower:]

Lowercase letters (a-z)

[:upper:]

Uppercase letters (A-Z)

[:digit:]

Digits (0-9)

[:alnum:]

Letters and digits

[:punct:]

Punctuation: . , ; : ! ? " ' ( ) [ ] { } / \ @ # $ % ^ & * - _ + = ~ ` |

[:space:]

All whitespace: space, tab, newline, return, form-feed

[:blank:]

Space and tab only

[:graph:]

All visible characters (alnum + punct)

[:print:]

Printable characters (graph + space)

Quantifiers

Quantifiers specify how many times the preceding element should match. The table below gives a complete overview:

Quantifier

Meaning

R example

Example matches

*

0 or more (greedy)

"b*"

"" "b" "bbb"

+

1 or more (greedy)

"b+"

"b" "bbb"

?

0 or 1 — makes element optional (greedy)

"colou?r"

"color" "colour"

{n}

Exactly n times

"[a-z]{5}"

"hello" "world"

{n,}

n or more times (greedy)

"[a-z]{3,}"

"cat" "grammar"

{n,m}

Between n and m times (greedy)

"[a-z]{3,6}"

"cat" "gram" "syntax"

*?

0 or more (lazy — as few as possible)

"<.*?>"

first tag only in "<b>bold</b>"

+?

1 or more (lazy — as few as possible)

"<.+?>"

first tag only in "<b>bold</b>"

??

0 or 1 (lazy)

"colou??r"

"color" "colour"

{n,m}?

Between n and m times (lazy)

"[a-z]{3,6}?"

shortest run of 3-6 letters

Code
# * : 0 or more
str_extract_all("aabbbcccc", "b*")   # matches "", "bbb", ""...
[[1]]
[1] ""    ""    "bbb" ""    ""    ""    ""    ""   
Code
# + : 1 or more
str_extract_all("aabbbcccc", "b+")   # matches "bbb"
[[1]]
[1] "bbb"
Code
# ? : 0 or 1 (makes the element optional)
str_detect(c("color", "colour"), "colou?r")   # both TRUE
[1] TRUE TRUE
Code
# {n} : exactly n
str_extract_all(et, "[a-z]{10}")     # exactly 10-letter sequences
[[1]]
 [1] "communicat" "intuitivel" "grammatica" "instructio" "rescriptiv"
 [6] "descriptiv" "understand" "acquisitio" "morphology" "investigat"
[11] "influences" "interpreta" "omputation" "linguistic" "processing"
[16] "expression" "extracting"
Code
# {n,} : n or more
str_extract_all(tokens, "^[[:alpha:]]{8,}$")  # words of 8+ letters
[[1]]
character(0)

[[2]]
character(0)

[[3]]
character(0)

[[4]]
character(0)

[[5]]
character(0)

[[6]]
character(0)

[[7]]
character(0)

[[8]]
character(0)

[[9]]
[1] "sometimes"

[[10]]
[1] "describe"

[[11]]
character(0)

[[12]]
character(0)

[[13]]
character(0)

[[14]]
character(0)

[[15]]
character(0)

[[16]]
character(0)

[[17]]
character(0)

[[18]]
character(0)

[[19]]
character(0)

[[20]]
character(0)

[[21]]
character(0)

[[22]]
[1] "language"

[[23]]
character(0)

[[24]]
character(0)

[[25]]
character(0)

[[26]]
character(0)

[[27]]
character(0)

[[28]]
character(0)

[[29]]
character(0)

[[30]]
character(0)

[[31]]
character(0)

[[32]]
character(0)

[[33]]
character(0)

[[34]]
[1] "somebody"

[[35]]
character(0)

[[36]]
character(0)

[[37]]
character(0)

[[38]]
character(0)

[[39]]
character(0)

[[40]]
character(0)

[[41]]
character(0)

[[42]]
character(0)

[[43]]
character(0)

[[44]]
character(0)

[[45]]
character(0)

[[46]]
character(0)

[[47]]
character(0)

[[48]]
character(0)

[[49]]
character(0)

[[50]]
character(0)

[[51]]
[1] "languages"

[[52]]
character(0)

[[53]]
character(0)

[[54]]
character(0)

[[55]]
character(0)

[[56]]
character(0)

[[57]]
[1] "Languages"

[[58]]
character(0)

[[59]]
character(0)

[[60]]
character(0)

[[61]]
character(0)

[[62]]
character(0)

[[63]]
[1] "communicate"

[[64]]
character(0)

[[65]]
character(0)

[[66]]
character(0)

[[67]]
[1] "Grammars"

[[68]]
[1] "developed"

[[69]]
character(0)

[[70]]
character(0)

[[71]]
character(0)

[[72]]
character(0)

[[73]]
character(0)

[[74]]
[1] "described"

[[75]]
character(0)

[[76]]
character(0)

[[77]]
character(0)

[[78]]
character(0)

[[79]]
character(0)

[[80]]
[1] "Languages"

[[81]]
character(0)

[[82]]
character(0)

[[83]]
character(0)

[[84]]
character(0)

[[85]]
character(0)

[[86]]
character(0)

[[87]]
[1] "Children"

[[88]]
character(0)

[[89]]
character(0)

[[90]]
character(0)

[[91]]
character(0)

[[92]]
character(0)

[[93]]
character(0)

[[94]]
[1] "language"

[[95]]
character(0)

[[96]]
character(0)

[[97]]
character(0)

[[98]]
character(0)

[[99]]
character(0)

[[100]]
character(0)

 [ reached getOption("max.print") -- omitted 110 entries ]
Code
# {n,m} : between n and m
str_extract_all(tokens, "^[[:alpha:]]{4,6}$") # words of 4-6 letters
[[1]]
character(0)

[[2]]
character(0)

[[3]]
character(0)

[[4]]
[1] "system"

[[5]]
character(0)

[[6]]
character(0)

[[7]]
character(0)

[[8]]
[1] "People"

[[9]]
character(0)

[[10]]
character(0)

[[11]]
character(0)

[[12]]
character(0)

[[13]]
character(0)

[[14]]
[1] "rules"

[[15]]
character(0)

[[16]]
character(0)

[[17]]
character(0)

[[18]]
character(0)

[[19]]
character(0)

[[20]]
[1] "fact"

[[21]]
character(0)

[[22]]
character(0)

[[23]]
character(0)

[[24]]
character(0)

[[25]]
character(0)

[[26]]
character(0)

[[27]]
character(0)

[[28]]
character(0)

[[29]]
[1] "word"

[[30]]
character(0)

[[31]]
character(0)

[[32]]
character(0)

[[33]]
[1] "that"

[[34]]
character(0)

[[35]]
character(0)

[[36]]
character(0)

[[37]]
[1] "rules"

[[38]]
[1] "first"

[[39]]
character(0)

[[40]]
[1] "then"

[[41]]
[1] "spoke"

[[42]]
character(0)

[[43]]
character(0)

[[44]]
[1] "like"

[[45]]
character(0)

[[46]]
[1] "rules"

[[47]]
character(0)

[[48]]
character(0)

[[49]]
character(0)

[[50]]
character(0)

[[51]]
character(0)

[[52]]
character(0)

[[53]]
character(0)

[[54]]
[1] "start"

[[55]]
[1] "like"

[[56]]
character(0)

[[57]]
character(0)

[[58]]
character(0)

[[59]]
[1] "when"

[[60]]
[1] "humans"

[[61]]
character(0)

[[62]]
character(0)

[[63]]
character(0)

[[64]]
[1] "with"

[[65]]
[1] "each"

[[66]]
character(0)

[[67]]
character(0)

[[68]]
character(0)

[[69]]
character(0)

[[70]]
[1] "After"

[[71]]
[1] "some"

[[72]]
character(0)

[[73]]
[1] "people"

[[74]]
character(0)

[[75]]
character(0)

[[76]]
character(0)

[[77]]
character(0)

[[78]]
[1] "their"

[[79]]
character(0)

[[80]]
character(0)

[[81]]
[1] "change"

[[82]]
[1] "over"

[[83]]
character(0)

[[84]]
character(0)

[[85]]
character(0)

[[86]]
character(0)

[[87]]
character(0)

[[88]]
[1] "learn"

[[89]]
character(0)

[[90]]
character(0)

[[91]]
character(0)

[[92]]
[1] "their"

[[93]]
[1] "first"

[[94]]
character(0)

[[95]]
character(0)

[[96]]
[1] "They"

[[97]]
character(0)

[[98]]
character(0)

[[99]]
[1] "need"

[[100]]
character(0)

 [ reached getOption("max.print") -- omitted 110 entries ]

Greedy vs. lazy matching

By default, quantifiers are greedy — they match as much as possible. Adding ? after a quantifier makes it lazy — it matches as little as possible:

Code
html <- "<b>bold</b> and <i>italic</i>"

# Greedy: matches from first < to LAST >
str_extract(html, "<.+>")
[1] "<b>bold</b> and <i>italic</i>"
Code
# Lazy: matches from first < to next >
str_extract(html, "<.+?>")
[1] "<b>"
Code
# Extract each tag individually (lazy)
str_extract_all(html, "<.+?>")
[[1]]
[1] "<b>"  "</b>" "<i>"  "</i>"

Groups and alternation

Parentheses () create a capturing group — a sub-pattern whose match can be referenced or extracted separately. The alternation operator | means OR within a group or pattern.

Code
# Alternation: match "cat" OR "dog"
str_detect(c("I have a cat", "I have a dog", "I have a fish"),
           "cat|dog")
[1]  TRUE  TRUE FALSE
Code
# Alternation inside a group: match "colour" OR "color"
str_extract_all(c("British colour", "American color"), "colo(u|)r")
[[1]]
[1] "colour"

[[2]]
[1] "color"
Code
# Match all forms of "walk": walk, walks, walked, walking
str_extract_all(et, "walk(s|ed|ing|er)?")
[[1]]
character(0)
Code
# Groups allow repetition of a sub-pattern
str_detect("abababab", "(ab)+")   # matches one or more "ab"
[1] TRUE

Non-capturing groups

Use (?:...) when you need to group for alternation or quantification but do not need to capture the match:

Code
# Group for alternation without capturing
str_extract_all(et, "(?:gram|morpho|phono)logy")
[[1]]
[1] "morphology"

Backreferences in replacements

Captured groups can be referred back to in replacement strings using \\1, \\2, etc.:

Code
# Swap the two words on either side of "and"
str_replace_all("cats and dogs", "(\\w+) and (\\w+)", "\\2 and \\1")
[1] "dogs and cats"
Code
# Wrap all long words in asterisks (using \\0 for the whole match)
str_replace_all(et, "\\b[[:alpha:]]{8,}\\b", "**\\0**") |>
  substr(1, 120)
[1] "Grammar is the system of a **language**. People **sometimes** **describe** grammar as the rules of a **language**, but i"

Special escape sequences

R supports shorthand escape sequences for common character classes:

Sequence

Matches

Example (R string)

\\w

Word characters: [[:alnum:]_]

"\\w+"

\\W

Non-word characters: [^[:alnum:]_]

"\\W+"

\\d

Digits: [[:digit:]]

"\\d+"

\\D

Non-digits: [^[:digit:]]

"\\D+"

\\s

Whitespace: [[:space:]]

"\\s+"

\\S

Non-whitespace: [^[:space:]]

"\\S+"

\\b

Word boundary (position)

"\\bcat\\b"

\\B

Non-word boundary (position)

"\\Bcat\\B"

Code
# \w: word characters
str_extract_all("price: $4.99!", "\\w+")
[[1]]
[1] "price" "4"     "99"   
Code
# \d: digits
str_extract_all("Call 07 3365 1234 or 07 3346 5678", "\\d+")
[[1]]
[1] "07"   "3365" "1234" "07"   "3346" "5678"
Code
# \s: whitespace (useful for splitting on any whitespace)
str_split("word1   word2\tword3\nword4", "\\s+")[[1]]
[1] "word1" "word2" "word3" "word4"
Code
# \b: whole-word match
str_extract_all("grammar, grammarian, ungrammatical", "\\bgrammar\\b")
[[1]]
[1] "grammar"

Lookahead and lookbehind

Lookaround assertions match a position based on what comes before or after it, without including that context in the match. They are essential for extracting values that are preceded or followed by specific markers.

Syntax

Name

Matches

(?=...)

Positive lookahead

Position followed by ...

(?!...)

Negative lookahead

Position NOT followed by ...

(?<=...)

Positive lookbehind

Position preceded by ...

(?<!...)

Negative lookbehind

Position NOT preceded by ...

Code
prices <- c("$12.99", "$4.50", "USD 7.00", "8.95 EUR")

# Positive lookahead: match digits followed by a dot
str_extract_all(prices, "\\d+(?=\\.)")
[[1]]
[1] "12"

[[2]]
[1] "4"

[[3]]
[1] "7"

[[4]]
[1] "8"
Code
# Positive lookbehind: match digits preceded by "$"
str_extract_all(prices, "(?<=\\$)\\d+\\.\\d+")
[[1]]
[1] "12.99"

[[2]]
[1] "4.50"

[[3]]
character(0)

[[4]]
character(0)
Code
# Negative lookbehind: match numbers NOT preceded by "$"
str_extract_all(prices, "(?<!\\$)\\b\\d+\\.\\d+")
[[1]]
character(0)

[[2]]
character(0)

[[3]]
[1] "7.00"

[[4]]
[1] "8.95"

A linguistic example — extract words that come before a comma:

Code
sample_text <- "Grammar, syntax, and morphology are core subfields of linguistics."
str_extract_all(sample_text, "\\w+(?=,)")
[[1]]
[1] "Grammar" "syntax" 

Exercises: Regex Patterns

Q1. What does the regex ^[A-Z] match?





Q2. What is the difference between colou?r and colo[u]?r?





Q3. You want to match words of exactly 5 characters that consist only of lowercase letters. Which pattern is correct?






Key stringr Functions

Section Overview

What you will learn: The stringr functions used most frequently with regular expressions, and when to use each.

Key functions: str_detect(), str_count(), str_extract(), str_extract_all(), str_replace(), str_replace_all(), str_remove(), str_remove_all(), str_split(), str_locate()

The stringr package provides a consistent, user-friendly interface to regular expressions in R. All stringr functions follow the same pattern: the string comes first, the pattern second.

str_detect()

Returns TRUE/FALSE for each string in a vector. Most commonly used for filtering:

Code
words_sample <- c("grammar", "syntax", "morphology", "phonology",
                  "pragmatics", "grammarian", "ungrammatical")

# Which words contain "gram"?
str_detect(words_sample, "gram")
[1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
Code
# Which words start with a vowel?
str_detect(words_sample, "^[aeiou]")
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
Code
# Negate with !
words_sample[!str_detect(words_sample, "gram")]
[1] "syntax"     "morphology" "phonology"  "pragmatics"

str_count()

Counts non-overlapping occurrences of a pattern within each string:

Code
# How many vowels in each word?
str_count(words_sample, "[aeiou]")
[1] 2 1 3 3 3 4 5
Code
# How many times does the word "a" appear in the example text?
str_count(et, "\\ba\\b")
[1] 5

str_extract() and str_extract_all()

str_extract() returns the first match in each string. str_extract_all() returns all matches as a list:

Code
# Extract the first sequence of 3+ consonants
str_extract(words_sample, "[^aeiou]{3,}")
[1] NA     "synt" "rph"  NA     NA     NA     "ngr" 
Code
# Extract all sequences of digits from a mixed string
mixed <- c("price: 12.99 dollars", "code: A4-B12", "year: 2024")
str_extract_all(mixed, "\\d+")
[[1]]
[1] "12" "99"

[[2]]
[1] "4"  "12"

[[3]]
[1] "2024"
Code
# Extract all words longer than 8 characters from the example text
long_words <- str_extract_all(et, "\\b[[:alpha:]]{9,}\\b")[[1]]
sort(unique(long_words))
 [1] "acquisition"    "communicate"    "Computational"  "described"     
 [5] "describes"      "descriptive"    "developed"      "expressions"   
 [9] "extracting"     "grammatically"  "influences"     "instruction"   
[13] "interpretation" "intuitively"    "investigates"   "languages"     
[17] "Languages"      "linguistics"    "Linguists"      "morphology"    
[21] "naturally"      "Phonology"      "Pragmatics"     "Prescriptive"  
[25] "processing"     "searching"      "sometimes"      "structure"     
[29] "understand"    

str_replace() and str_replace_all()

Replace the first (or all) occurrence(s) of a pattern with a replacement string. Backreferences (\\1, \\2) refer to captured groups in the replacement:

Code
# Replace first match
str_replace(sent, "[csm]at", "dog")
[1] "The dog sat on the mat."
Code
# Replace all matches
str_replace_all(sent, "[csm]at", "dog")
[1] "The dog dog on the dog."
Code
# Backreference: reverse the order of two words separated by "and"
str_replace_all("cats and dogs", "(\\w+) and (\\w+)", "\\2 and \\1")
[1] "dogs and cats"
Code
# Add emphasis around all long words
str_replace_all(et, "\\b[[:alpha:]]{8,}\\b", "**\\0**") |>
  substr(1, 120)
[1] "Grammar is the system of a **language**. People **sometimes** **describe** grammar as the rules of a **language**, but i"

str_remove() and str_remove_all()

Shorthand for str_replace(x, pattern, "") and str_replace_all(x, pattern, ""):

Code
# Remove all punctuation from the sentence
str_remove_all(sent, "[[:punct:]]")
[1] "The cat sat on the mat"
Code
# Remove all digits
str_remove_all("Call us on 07-3365-1234", "\\d")
[1] "Call us on --"
Code
# Remove leading and trailing whitespace
str_remove_all("  linguistics  ", "^\\s+|\\s+$")
[1] "linguistics"
Code
# Keep only tokens of 4+ letters
long_tokens <- tokens[str_detect(tokens, "^[[:alpha:]]{4,}$")]
head(long_tokens, 10)
 [1] "Grammar"   "system"    "People"    "sometimes" "describe"  "grammar"  
 [7] "rules"     "fact"      "language"  "word"     

str_split()

Split strings on a pattern, returning a list:

Code
# Split on whitespace
str_split("the cat sat on the mat", "\\s+")[[1]]
[1] "the" "cat" "sat" "on"  "the" "mat"
Code
# Split on punctuation or whitespace
str_split("one,two; three    four", "[[:punct:]\\s]+")[[1]]
[1] "one"   "two"   "three" "four" 
Code
# Split a text into sentences (approximate)
sentences <- str_split(et, "(?<=[.!?])\\s+")[[1]]
head(sentences, 3)
[1] "Grammar is the system of a language."                                                                                             
[2] "People sometimes describe grammar as the rules of a language, but in fact no language has rules."                                 
[3] "If we use the word rules, we suggest that somebody created the rules first and then spoke the language, like the rules of a game."

str_locate()

Returns the start and end positions of matches — useful when you need to know where in the string a pattern occurs:

Code
# Find where "grammar" first occurs in the example text
str_locate(et, "grammar")
     start end
[1,]    64  70
Code
# Find all occurrences
str_locate_all(et, "\\bgrammar\\b")[[1]]
     start  end
[1,]    64   70
[2,]   442  448
[3,]   538  544
[4,]   728  734
[5,]   786  792
[6,]   847  853
[7,]  1237 1243

Exercises: stringr Functions

Q1. What is the difference between str_extract() and str_extract_all()?





Q2. You want to capitalise all words longer than 5 characters in a text. Which stringr function would you use?






Practical Applications

Section Overview

What you will learn: How to apply regular expressions to realistic corpus linguistics and text processing tasks.

Tasks covered: Corpus searching, text cleaning, metadata extraction, frequency analysis, and dplyr integration.

Searching a corpus: concordance-style extraction

A common corpus task is retrieving all contexts in which a pattern appears. We simulate a small multi-document corpus:

Code
corpus <- data.frame(
  doc_id   = paste0("doc", 1:10),
  register = rep(c("Academic", "News"), each = 5),
  text     = c(
    "Grammar is the systematic study of the structure of a language.",
    "Morphology examines how words are formed from smaller units called morphemes.",
    "Syntax deals with the arrangement of words to form grammatical sentences.",
    "Phonology studies the sound systems and phonological rules of languages.",
    "Pragmatics investigates how context and intention affect meaning in communication.",
    "Scientists announced a major breakthrough in natural language processing yesterday.",
    "The new grammar checker software was released to the public on Monday morning.",
    "Researchers found that bilingual speakers process syntax differently than monolinguals.",
    "Language acquisition in children follows predictable phonological and syntactic stages.",
    "The government launched a literacy program to improve grammar skills in schools."
  ),
  stringsAsFactors = FALSE
)
Code
# Find all documents containing words ending in "-ology"
corpus |>
  dplyr::filter(str_detect(text, "\\b\\w+ology\\b")) |>
  dplyr::select(doc_id, register, text)
  doc_id register
1   doc2 Academic
2   doc4 Academic
                                                                           text
1 Morphology examines how words are formed from smaller units called morphemes.
2      Phonology studies the sound systems and phonological rules of languages.
Code
# Extract all "-ology" words from each document
corpus |>
  dplyr::mutate(
    ology_words = sapply(text, function(t)
      paste(str_extract_all(t, "\\b\\w+ology\\b")[[1]], collapse = ", "))
  ) |>
  dplyr::filter(ology_words != "") |>
  dplyr::select(doc_id, ology_words)
  doc_id ology_words
1   doc2  Morphology
2   doc4   Phonology

Counting pattern frequencies

Code
# Count occurrences of "grammar" (case-insensitive) per document
corpus |>
  dplyr::mutate(
    n_grammar = str_count(text, regex("grammar", ignore_case = TRUE))
  ) |>
  dplyr::select(doc_id, register, n_grammar) |>
  dplyr::arrange(dplyr::desc(n_grammar))
   doc_id register n_grammar
1    doc1 Academic         1
2    doc7     News         1
3   doc10     News         1
4    doc2 Academic         0
5    doc3 Academic         0
6    doc4 Academic         0
7    doc5 Academic         0
8    doc6     News         0
9    doc8     News         0
10   doc9     News         0
Code
# Count how often each linguistic subfield is mentioned
subfields <- c("syntax", "morphology", "phonology", "pragmatics", "grammar")
subfield_counts <- sapply(subfields, function(sf)
  sum(str_count(corpus$text, regex(sf, ignore_case = TRUE))))

data.frame(subfield = subfields, count = subfield_counts) |>
  dplyr::arrange(dplyr::desc(count)) |>
  flextable() |>
  flextable::set_table_properties(width = .4, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 12) |>
  flextable::fontsize(size = 12, part = "header") |>
  flextable::align_text_col(align = "center") |>
  flextable::set_caption(caption = "Frequency of linguistic subfield terms in the corpus.") |>
  flextable::border_outer()

subfield

count

grammar

3

syntax

2

morphology

1

phonology

1

pragmatics

1

Text cleaning

Regular expressions are the primary tool for cleaning raw corpus text:

Code
raw_texts <- c(
  "   Grammar  is the  system   of a language.   ",
  "Words like 'cat', 'bat', and 'hat' rhyme!",
  "Phone: +61-7-3365-1234  |  Email: info@uq.edu.au",
  "Chapter 4: Syntax (pp. 112--145) — see also §3.2",
  "The year\t2024\twas notable for advances in NLP."
)

raw_texts |>
  # Normalise whitespace (collapse multiple spaces/tabs to one)
  str_replace_all("\\s+", " ") |>
  # Remove leading and trailing whitespace
  str_trim() |>
  # Remove content in parentheses
  str_remove_all("\\(.*?\\)") |>
  # Remove section references (§3.2 etc.)
  str_remove_all(\\d+\\.\\d+") |>
  # Remove em dashes and following spaces
  str_remove_all("—\\s*") |>
  # Trim again after removals
  str_trim()
[1] "Grammar is the system of a language."          
[2] "Words like 'cat', 'bat', and 'hat' rhyme!"     
[3] "Phone: +61-7-3365-1234 | Email: info@uq.edu.au"
[4] "Chapter 4: Syntax  see also"                   
[5] "The year 2024 was notable for advances in NLP."

Extracting structured information

A powerful application of regex is extracting structured information from free text:

Code
# Simulate file names with embedded metadata
file_names <- c(
  "speaker01_female_academic_2019.txt",
  "speaker14_male_news_2021.txt",
  "speaker07_female_fiction_2020.txt",
  "speaker23_male_academic_2022.txt"
)

# Extract each metadata component
data.frame(
  filename   = file_names,
  speaker_id = str_extract(file_names, "speaker\\d+"),
  gender     = str_extract(file_names, "(?<=_)(female|male)(?=_)"),
  register   = str_extract(file_names, "(?<=_(female|male)_)\\w+"),
  year       = str_extract(file_names, "\\d{4}")
)
                            filename speaker_id gender      register year
1 speaker01_female_academic_2019.txt  speaker01 female academic_2019 2019
2       speaker14_male_news_2021.txt  speaker14   male     news_2021 2021
3  speaker07_female_fiction_2020.txt  speaker07 female  fiction_2020 2020
4   speaker23_male_academic_2022.txt  speaker23   male academic_2022 2022

Case-insensitive matching

By default, regex in stringr is case-sensitive. Use regex(..., ignore_case = TRUE) to match regardless of case:

Code
# Match "Grammar", "GRAMMAR", "grammar" etc.
str_detect(c("Grammar", "GRAMMAR", "grammar", "GrAmMaR"),
           regex("grammar", ignore_case = TRUE))
[1] TRUE TRUE TRUE TRUE
Code
# Extract all mentions of a term regardless of capitalisation
str_extract_all(et, regex("\\bgrammar\\w*\\b", ignore_case = TRUE))[[1]]
 [1] "Grammar"  "grammar"  "Grammars" "grammar"  "Grammar"  "grammar" 
 [7] "grammar"  "grammar"  "grammar"  "grammars" "grammar" 

Regex in dplyr pipelines

Regular expressions integrate seamlessly with dplyr for filtering and creating new columns:

Code
corpus |>
  dplyr::filter(str_detect(text, regex("syntax|morphology", ignore_case = TRUE))) |>
  dplyr::mutate(
    primary_topic = str_extract(text,
      regex("syntax|morphology|phonology|pragmatics|grammar",
            ignore_case = TRUE)),
    n_words       = str_count(text, "\\S+"),
    has_definition = str_detect(text, "\\bis\\b|\\bdeals with\\b|\\bexamines\\b")
  ) |>
  dplyr::select(doc_id, register, primary_topic, n_words, has_definition)
  doc_id register primary_topic n_words has_definition
1   doc2 Academic    Morphology      11           TRUE
2   doc3 Academic        Syntax      11           TRUE
3   doc8     News        syntax      10          FALSE

Exercises: Practical Applications

Q1. What regular expression would you use to extract all words that contain at least one digit (e.g., “A4”, “mp3”, “COVID-19”)?





Q2. You want to extract the domain name from email addresses (the part after @ and before the final .). Which regex extracts uq from user@uq.edu.au?





Q3. What does str_replace_all(text, \"(\\\\w+) and (\\\\w+)\", \"\\\\2 and \\\\1\") do?






Corpus Search Exercises

Section Overview

Ten practical exercises covering the most common corpus-search regex tasks.

Each question asks you to identify the correct regular expression for a realistic search task on a tokenised text vector. All answers use stringr::str_detect() applied to a character vector called text.

Q1. Which regex extracts all forms of walk from a tokenised text (walk, walks, walked, walking, walker)?





Q2. Which regex extracts all words beginning with “un” (e.g., ungrammatical, unusual, undo)?





Q3. Which regex finds all numeric tokens (whole numbers like 2024, 42, 100)?





Q4. Which regex extracts all words ending in -ing (e.g., running, working, thinking)?





Q5. Which regex matches email addresses (e.g., cat@uq.edu.au, info@ladal.edu.au)?





Q6. Which regex identifies tokens that contain at least one digit mixed with letters (e.g., mp3, A4, COVID-19, type2)?





Q7. Which regex extracts hyphenated compound words (e.g., well-being, self-aware, long-term)?





Q8. Which regex finds capitalised tokens — words beginning with an uppercase letter followed by lowercase letters (e.g., proper nouns like London, Paris, Grammar)?





Q9. Which regex finds tokens that are questions ending with a question mark (e.g., you?, this?)?





Q10. Which regex finds tokens containing double vowels (e.g., agreement, book, see, moon)?






Quick Reference

Section Overview

A compact reference for the most commonly used regex elements in R.

Pattern summary table

Pattern

Meaning

.

Any character except newline

^

Start of string / line

$

End of string / line

\\b

Word boundary

\\B

Non-word boundary

[abc]

One of: a, b, or c

[^abc]

Not a, b, or c

[a-z]

Lowercase letter

[[:alpha:]]

Any letter

[[:digit:]]

Any digit

[[:punct:]]

Any punctuation

*

0 or more (greedy)

+

1 or more (greedy)

?

0 or 1 — optional (greedy)

{n}

Exactly n times

{n,}

n or more times (greedy)

{n,m}

Between n and m times (greedy)

*?

0 or more (lazy)

+?

1 or more (lazy)

{n,m}?

Between n and m times (lazy)

(abc)

Capturing group

(?:abc)

Non-capturing group

a|b

a or b

\\w

Word character [a-zA-Z0-9_]

\\d

Digit [0-9]

\\s

Whitespace

\\W

Non-word character

\\D

Non-digit

\\S

Non-whitespace

(?=...)

Positive lookahead

(?!...)

Negative lookahead

(?<=...)

Positive lookbehind

(?<!...)

Negative lookbehind

stringr function summary

Function

Returns

str_detect(x, p)

logical vector — does p match?

str_count(x, p)

integer vector — how many matches?

str_extract(x, p)

character vector — first match (NA if none)

str_extract_all(x, p)

list of character vectors — all matches

str_replace(x, p, r)

character vector — first match replaced

str_replace_all(x, p, r)

character vector — all matches replaced

str_remove(x, p)

character vector — first match removed

str_remove_all(x, p)

character vector — all matches removed

str_split(x, p)

list of character vectors — parts between matches

str_locate(x, p)

integer matrix — start and end of first match

str_locate_all(x, p)

list of integer matrices — all match positions

str_starts(x, p)

logical — does x start with p?

str_ends(x, p)

logical — does x end with p?


Citation & Session Info

Citation

Martin Schweinberger. 2026. Regular Expressions in R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/regex/regex.html (Version 2026.03.28), doi: .

@manual{martinschweinberger2026regular,
  author       = {Martin Schweinberger},
  title        = {Regular Expressions in R},
  year         = {2026},
  note         = {https://ladal.edu.au/tutorials/regex/regex.html},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
  edition      = {2026.03.28}
  doi      = {}
}
Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] checkdown_0.0.13 flextable_0.9.11 lubridate_1.9.4  forcats_1.0.0   
 [5] stringr_1.6.0    dplyr_1.2.0      purrr_1.2.1      readr_2.1.5     
 [9] tidyr_1.3.2      tibble_3.3.1     ggplot2_4.0.2    tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] generics_0.1.4          fontLiberation_0.1.0    renv_1.1.7             
 [4] xml2_1.3.6              stringi_1.8.7           hms_1.1.4              
 [7] digest_0.6.39           magrittr_2.0.4          evaluate_1.0.5         
[10] grid_4.4.2              timechange_0.3.0        RColorBrewer_1.1-3     
[13] fastmap_1.2.0           jsonlite_2.0.0          zip_2.3.2              
[16] BiocManager_1.30.27     scales_1.4.0            fontBitstreamVera_0.1.1
[19] codetools_0.2-20        textshaping_1.0.0       cli_3.6.5              
[22] rlang_1.1.7             fontquiver_0.2.1        litedown_0.9           
[25] commonmark_2.0.0        withr_3.0.2             yaml_2.3.10            
[28] gdtools_0.5.0           tools_4.4.2             officer_0.7.3          
[31] uuid_1.2-1              tzdb_0.5.0              vctrs_0.7.2            
[34] R6_2.6.1                lifecycle_1.0.5         htmlwidgets_1.6.4      
[37] ragg_1.5.1              pkgconfig_2.0.3         pillar_1.11.1          
[40] gtable_0.3.6            glue_1.8.0              data.table_1.17.0      
[43] Rcpp_1.1.1              systemfonts_1.3.1       xfun_0.56              
[46] tidyselect_1.2.1        rstudioapi_0.17.1       knitr_1.51             
[49] farver_2.1.2            patchwork_1.3.0         htmltools_0.5.9        
[52] rmarkdown_2.30          compiler_4.4.2          S7_0.2.1               
[55] markdown_2.0            askpass_1.2.1           openssl_2.3.2          
AI Transparency Statement

This tutorial was re-developed with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the checkdown quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.

Back to top

Back to HOME

References

Friedl, Jeffrey EF. 2006. Mastering Regular Expressions. Sebastopol, CA: "O’Reilly Media".
Peng, Roger D. 2020. R Programming for Data Science. Leanpub. https://bookdown.org/rdpeng/rprogdatascience/.